This dataset consists of tv shows and movies available on Netflix. The dataset is collected from Flixable which is a third-party Netflix search engine.
In 2018, they released an interesting report which shows that the number of TV shows on Netflix has nearly tripled since 2010. The streaming service’s number of movies has decreased by more than 2,000 titles since 2010, while its number of TV shows has nearly tripled. It will be interesting to explore what all other insights can be obtained from the same dataset.
Main Task
Build a recommended engine to return maximum 10 movie names if an user search for a particular movie.
Interesting questions
Steps:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import warnings
warnings.filterwarnings("ignore")
df = pd.read_csv('netflix_titles.csv')
df_original = df.copy()
print("dataframe shape :", df.shape)
df.head()
dataframe shape : (7787, 12)
| show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | s1 | TV Show | 3% | NaN | João Miguel, Bianca Comparato, Michel Gomes, R... | Brazil | August 14, 2020 | 2020 | TV-MA | 4 Seasons | International TV Shows, TV Dramas, TV Sci-Fi &... | In a future where the elite inhabit an island ... |
| 1 | s2 | Movie | 7:19 | Jorge Michel Grau | Demián Bichir, Héctor Bonilla, Oscar Serrano, ... | Mexico | December 23, 2016 | 2016 | TV-MA | 93 min | Dramas, International Movies | After a devastating earthquake hits Mexico Cit... |
| 2 | s3 | Movie | 23:59 | Gilbert Chan | Tedd Chan, Stella Chung, Henley Hii, Lawrence ... | Singapore | December 20, 2018 | 2011 | R | 78 min | Horror Movies, International Movies | When an army recruit is found dead, his fellow... |
| 3 | s4 | Movie | 9 | Shane Acker | Elijah Wood, John C. Reilly, Jennifer Connelly... | United States | November 16, 2017 | 2009 | PG-13 | 80 min | Action & Adventure, Independent Movies, Sci-Fi... | In a postapocalyptic world, rag-doll robots hi... |
| 4 | s5 | Movie | 21 | Robert Luketic | Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar... | United States | January 1, 2020 | 2008 | PG-13 | 123 min | Dramas | A brilliant group of students become card-coun... |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7787 entries, 0 to 7786 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 show_id 7787 non-null object 1 type 7787 non-null object 2 title 7787 non-null object 3 director 5398 non-null object 4 cast 7069 non-null object 5 country 7280 non-null object 6 date_added 7777 non-null object 7 release_year 7787 non-null int64 8 rating 7780 non-null object 9 duration 7787 non-null object 10 listed_in 7787 non-null object 11 description 7787 non-null object dtypes: int64(1), object(11) memory usage: 730.2+ KB
df['type'].value_counts().index
Index(['Movie', 'TV Show'], dtype='object')
Program types on netflix
fig = px.pie(df,values=df['type'].value_counts().values,names=df['type'].value_counts().index,
title='Movie vs. TV Show on Netflix',
color_discrete_sequence= px.colors.qualitative.Bold)
fig.update_traces(textposition='inside', textinfo='percent+label',
marker_line_width=3, marker_line_color='white' )
fig.show()
Movie to TV contents ratio is 70:30 on netflix.
Top 20 countries with most content
df_country = df.copy()
df_country = df.groupby(['country','type']).count()['show_id'].reset_index()
df_country.rename(columns={'show_id':'count'},inplace=True)
# transform the country col with only 1 country in each row
#df_country = pd.concat([pd.Series(row['count'],row['country'].split(', ')) for _,row in df_country.iterrows()]).reset_index()
df_country = pd.concat([pd.DataFrame([[row['type'],row['count']]],row['country'].split(', ')) for _,row in df_country.iterrows()]).reset_index()
df_country.rename(columns={'index':'country',0:'type',1:'count'},inplace=True)
df_country = df_country.groupby(['country','type']).sum()['count'].reset_index()
df_country.head(1)
| country | type | count | |
|---|---|---|---|
| 0 | Afghanistan | Movie | 1 |
# build the top 20 list
df_country_20 = df_country.groupby(['country']).sum().sort_values(by='count',ascending=False).head(20)
top_20_lst = list(df_country_20.index)
#top_20_lst
# shortlist the dataframe to match the top 20 list
top_20_df = df_country[df_country['country'].apply(lambda x: x in top_20_lst)]
# plot
fig = px.bar(top_20_df,'country','count',color='type',title='Top 20 countries with most content',height=500,
category_orders = {'country':top_20_lst},
color_discrete_sequence= px.colors.qualitative.Bold)
fig.show()
Rank #1 is the US, followed by India and the UK. The US is more than 3 times of India in programs count.
The result is quite insteresting for Movie vs TV program split. Movie got higher share in US, but opposite result seen in Japan, South Korea, Taiwan - TV contents showed higher share.
Target audience in the top 10 countries
def target(rating):
#4 categories
target_lst = {
'TV-PG': 'Older Kids',
'TV-MA': 'Adults',
'TV-Y7-FV': 'Older Kids',
'TV-Y7': 'Older Kids',
'TV-14': 'Teens',
'R': 'Adults',
'TV-Y': 'Kids',
'NR': 'Adults',
'PG-13': 'Teens',
'TV-G': 'Kids',
'PG': 'Older Kids',
'G': 'Kids',
'UR': 'Adults',
'NC-17': 'Adults'
}
return target_lst[rating]
# prepare the dataframe for the plot
df_rating = df.copy()
#remove null rating values - 7 rows
df_rating.rating.isna().sum()
df_rating.dropna(subset=['rating'],inplace = True)
#create the target column
df_rating['target'] = df_rating['rating'].apply(target)
#df_rating.head(1)
# Prepare the country-rating df + trnasform the country col with only 1 country in each row
df_rating = df_rating.groupby(['country','target']).count()['show_id'].reset_index()
df_rating.rename(columns={'show_id':'count'},inplace=True)
df_rating = pd.concat([pd.DataFrame([[row['target'],row['count']]],row['country'].split(', ')) for _,row in df_rating.iterrows()]).reset_index()
df_rating.rename(columns={'index':'country',0:'target',1:'count'},inplace=True)
df_rating = df_rating.groupby(['country','target']).sum()['count'].reset_index()
df_rating.head(1)
| country | target | count | |
|---|---|---|---|
| 0 | Afghanistan | Adults | 1 |
# prepare top 10 list (by cropping the top 20 list built previously)
top_10_lst = top_20_lst[0:10]
# filter the dataframe to match the top 10 list
top_10_df = df_rating[df_rating['country'].apply(lambda x: x in top_10_lst)]
#pivot country x target
top_10_df = top_10_df.pivot_table(index='country',columns='target',values='count')
#normalize the values
top_10_df['total'] = top_10_df.sum(axis=1)
top_10_df = top_10_df.sort_values(by='total')
top_10_df = round(top_10_df.div(top_10_df['total'],axis=0)*100,1)
top_10_df_x = top_10_df.iloc[:,0:4].values
top_10_df_y = top_10_df.index
# prepare the plot
top_labels = ['Adults', 'Teens', 'Older Kids', 'Kids']
colors = ['rgba(38, 24, 74, 0.8)', 'rgba(71, 58, 131, 0.8)',
'rgba(122, 120, 168, 0.8)', 'rgba(164, 163, 204, 0.85)'
]
x_data = top_10_df_x
y_data = top_10_df_y
fig = go.Figure()
for i in range(0, len(x_data[0])):
for xd, yd in zip(x_data, y_data):
fig.add_trace(go.Bar(
x=[xd[i]], y=[yd],
orientation='h',
marker=dict(
color=colors[i],
line=dict(color='rgb(248, 248, 249)', width=1)
)
))
fig.update_layout(
xaxis=dict(
showgrid=False,
showline=False,
showticklabels=False,
zeroline=False,
domain=[0.15, 1]
),
yaxis=dict(
showgrid=False,
showline=False,
showticklabels=False,
zeroline=False,
),
barmode='stack',
paper_bgcolor='rgb(248, 248, 255)',
plot_bgcolor='rgb(248, 248, 255)',
margin=dict(l=120, r=10, t=140, b=80),
showlegend=False,
title='Target audience in the top 10 countries'
)
annotations = []
for yd, xd in zip(y_data, x_data):
# labeling the y-axis
annotations.append(dict(xref='paper', yref='y',
x=0.14, y=yd,
xanchor='right',
text=str(yd),
font=dict(family='Arial', size=14,
color='rgb(67, 67, 67)'),
showarrow=False, align='right'))
# labeling the first percentage of each bar (x_axis)
annotations.append(dict(xref='x', yref='y',
x=xd[0] / 2, y=yd,
text=str(xd[0]) + '%',
font=dict(family='Arial', size=11,
color='rgb(248, 248, 255)'),
showarrow=False))
# labeling the first Likert scale (on the top)
if yd == y_data[-1]:
annotations.append(dict(xref='x', yref='paper',
x=xd[0] / 2, y=1.1,
text=top_labels[0],
font=dict(family='Arial', size=14,
color='rgb(67, 67, 67)'),
showarrow=False))
space = xd[0]
for i in range(1, len(xd)):
# labeling the rest of percentages for each bar (x_axis)
annotations.append(dict(xref='x', yref='y',
x=space + (xd[i]/2), y=yd,
text=str(xd[i]) + '%',
font=dict(family='Arial', size=11,
color='rgb(248, 248, 255)'),
showarrow=False))
# labeling the rest Likert scale
if yd == y_data[-1]:
annotations.append(dict(xref='x', yref='paper',
x=space + (xd[i]/2), y=1.1,
text=top_labels[i],
font=dict(family='Arial', size=14,
color='rgb(67, 67, 67)'),
showarrow=False))
space += xd[i]
fig.update_layout(annotations=annotations)
fig.show()
Target audience in the top 10 countries
In general, Asian countries (South Korea, India, Japan) have lower proportion on adult contents (26% - 44%) while western countries have higher proportion on adults contents (42% - 79%). For the extreme cases, Spain has almost 80% occupied by adults content while india only has 26% adults content. Most content available in india is kids programs.(72%)
More about kid content in India:
The world of cartoons and kids’ content has caught the fancy of many over-the-top (OTT) streaming players in India. The top six to ten platforms including Netflix, Amazon Prime, Hotstar, Voot, Sony LIV and Eros Now — besides YouTube — have capitalised on the growing appetite for content created for kids or children in India. source: https://inc42.com/features/toons-and-beyond-indias-ott-players-take-baby-steps-into-edutainment-kids-content/
Let's have a look at the genres in US vs Korea
#South Korea
# func to remove those generic tags in genre
def remove_dup(genre):
lst = genre.split(', ')
remove_lst = ['International TV Shows','Korean TV Shows','International Movies','Dramas']
return(','.join(x for x in lst if x not in remove_lst))
df_sk = df[df.country=='South Korea']
df_sk['genre'] = df_sk['listed_in']
df_sk['genre'] = df_sk['genre'].apply(remove_dup)
# further filter genre col
df_sk.loc[df_sk['genre'].str.contains('Action & Adventure'),"genre"] = "Action & Adventure"
df_sk.loc[df_sk['genre'].str.contains('Romantic TV Shows'),"genre"] = "Romantic TV Shows"
df_sk.loc[df_sk['genre'].str.contains('Crime TV Shows'),"genre"] = "Crime TV Shows"
df_sk['genre'] = df_sk['genre'].apply(lambda x:x.split(',')[0])
df_sk.drop(df_sk[df_sk['genre']==''].index,axis=0,inplace=True)
# pie chart
fig = px.pie(df_sk,values=df_sk['genre'].value_counts().values,names=df_sk['genre'].value_counts().index,
title='Genres in South Korea',color_discrete_sequence= px.colors.qualitative.Bold,height=500)
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.show()
#US
df_us = df[(df.country.str.contains('United States')) & (df.country.notnull())]
df_us['genre'] = df_us['listed_in'].apply(lambda x:x.split(', ')[0])
fig = px.pie(df_us,values=df_us['genre'].value_counts().values,names=df_us['genre'].value_counts().index,
color_discrete_sequence= px.colors.qualitative.Bold,title='Genres in the US')
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.show()
What content is available in different countries? Let say US vs Korea?
For South Korea, most contents are romantic tv shows (40%), followed by Crime TV shows (13%). While for the US, the program types are quite diverse, main programs cover Dramas, documentaries, comedies, children movies,...etc.
TV vs Movie Trend - Release Year
#Release Year
df_year = df[df['release_year']>1999]
fig = px.histogram(df_year,x='release_year',facet_col='type',opacity=0.7,
category_orders = {'type':['Movie','TV Show']},
color='type',color_discrete_sequence= px.colors.qualitative.Bold,
title = "No. of programs by released year")
fig.show()
TV vs Movie Trend - Year_added to Netflix
#Year_added to Netflix
df['year_added'] = df['date_added'].str[-4:]
fig = px.histogram(df,x='year_added',facet_col='type',opacity=0.7,
category_orders = {'type':['Movie','TV Show']},
color='type',color_discrete_sequence= px.colors.qualitative.Bold,
title = "No. of programs by year added to Netflix")
fig.show()
# TV vs Movie across years
df_trend = df.groupby(['year_added','type']).count()['show_id'].reset_index()
df_trend.rename(columns={'show_id':'count'},inplace=True)
df_trend['total'] = df_trend.groupby('year_added')['count'].transform(lambda x:x.sum())
df_trend['percentage'] = df_trend['count']/df_trend['total']
df_trend['percentage'] = pd.Series(["{0:.0f}%".format(x * 100) for x in df_trend['percentage']],index=df_trend.index)
fig = px.bar(df_trend,'year_added','count',color='type',title='Movie vs. TV Show by year added to Netflix',height=500,text='percentage',
range_x = [2014,2022],color_discrete_sequence= px.colors.qualitative.Bold)
fig.update_xaxes(showgrid=True, ticks="outside")
fig.show()
Is Netflix has increasingly focusing on TV rather than movies in recent years?
The share between TV vs movie was similar in 2016 , then movie showed a big jump from 58% to 71% in 2017. Movie share reached the highest in 2018, occupying 74% of total. Yet, starting from 2019 TV share increased gradually, movie share dropped to 65% in 2020.
Probably due to covid-19, less movie launched and led to decrease in movie share?? Or simply because auidences' increasing interest in TV shows?!
#TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
df1 = df[df['type'] == 'Movie'].reset_index()
tfidf = TfidfVectorizer(analyzer='word',ngram_range=(1,2),min_df=1)
x = tfidf.fit_transform(df1.description)
#print(tfidf.get_feature_names())
from sklearn.metrics.pairwise import sigmoid_kernel
model = sigmoid_kernel(x,x)
def recommendations(title):
indices = pd.Series(data=df1.index,index=df1['title']).drop_duplicates()
idx = indices[title]
sig_scores = list(enumerate(model[idx]))
sig_scores = sorted(sig_scores, key=lambda x: x[1], reverse=True)
sig_scores = sig_scores[1:11]
idn = [i[0] for i in sig_scores]
print('Movies you might also like: \n ')
for i in range(0,10):
print(i+1, indices.iloc[idn].index[i])
#print(df1.iloc[idn].description.values[i],'\n')
recommendations('To All the Boys: P.S. I Still Love You')
Movies you might also like: 1 To All the Boys I’ve Loved Before 2 Daagdi Chaawl 3 Petta 4 Petta (Telugu Version) 5 Bhaji In Problem 6 Chris D'Elia: Incorrigible 7 Love You... Love You Not 8 Wildlife 9 The Clapper 10 Lez Bomb
#df1[df1['title'].str.contains('To All the Boys')]